Using genome and transcriptome data from African-ancestry female participants to identify putative breast cancer susceptibility genes

African-ancestry (AA) participants are underrepresented in genetics research. Here, we conducted a transcriptome-wide association study (TWAS) in AA female participants to identify putative breast cancer susceptibility genes. We built genetic models to predict levels of gene expression, exon junction, and 3′ UTR alternative polyadenylation using genomic and transcriptomic data generated in normal breast tissues from 150 AA participants and then used these models to perform association analyses using genomic data from 18,034 cases and 22,104 controls. At Bonferroni-corrected P < 0.05, we identified six genes associated with breast cancer risk, including four genes not previously reported (CTD-3080P12.3, EN1, LINC01956 and NUP210L). Most of these genes showed a stronger association with risk of estrogen-receptor (ER) negative or triple-negative than ER-positive breast cancer. We also replicated the associations with 29 genes reported in previous TWAS at P < 0.05 (one-sided), providing further support for an association of these genes with breast cancer risk. Our study sheds new light on the genetic basis of breast cancer and highlights the value of conducting research in AA populations.


The United States Radiologic Technologists (USRT) cohort:
The USRT is a cohort study for cancer incidence and mortality which recruited approximately 140,00 U.S. radiologic technologists who were certified for at least two years between 1926 and 1982. 13Breast cancer cases were confirmed based on pathology or medical records.
The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO): The PLCO is a multicenter, two-armed, randomized trial designed to evaluate the screening efficacy for prostate, lung, colorectal and ovarian cancer. 14It recruited approximately 155,000 men and females, aged 55-74 years, from 1993 to 2001.

Nigerian Breast Cancer Study (NBCS):
The NBCS is an ongoing case-control study of breast cancer in Ibadan, Nigeria initiated in 1998. 15,16Breast cancer cases were 20 years or older, ascertained at the University College Hospital, Ibadan, which is the oldest tertiary hospital in Nigerian with a catchment population of approximate three million.Controls were recruited from a randomly selected community in one of the communities adjoining the hospital.The majority of the study subjects were Yoruba and Yoruba is one of the populations selected by the International HapMap Project to represent African continent.

Chicago Cancer Prone Study (CCPS):
The CCPS is a hospital-based case-control study designed to investigate the genetics of young-onset breast cancer.Cases with histologically confirmed breast cancer were enrolled through the Cancer Risk Clinic at the University of Chicago.Young-onset cases and African Americans were oversampled.Controls were genderand age-matched with cases and enrolled from patients who visited the same hospital and were willing to donate blood for genetic studies.

Women of African Ancestry Breast Cancer Study (WAABCS):
The WAABCS is a hospitalbased case-control study originally started in Nigeria in 1998 and was expanded to Uganda and Cameroon in 2011 with the same questionnaires and protocol. 17,18rolina Breast Cancer Study (CBCS): The CBCS is a population-based case-control study conducted in 24 counties of central and eastern North Carolina.19 From 1993 to 2001, it recruited females aged between 20 and 74 years and diagnosed with invasive breast cancer.African American females and females aged less than 50 years were oversampled.Cases were identified by rapid case ascertainment system in cooperation with the North Carolina Central Cancer Registry. Cotrols were selected from the North Carolina Division of Motor Vehicle (for females younger than 65 years) and United States Health Care Financing Administration (for females aged 65 and older).Controls were approximately frequency matched to cases by age and race.
Blood samples were collected from participants with consent.

Women's Circle of Health Study (WCHS):
The WCHS is a case-control study established in 2003 in the New York City metropolitan areas, and beginning in 2006, from 10 counties in New Jersey. 20Eligible cases included females who were diagnosed with invasive breast cancer between 20 and 75 years of age and self-identified as European American or African American.Controls were initially identified through random digit dialing and were matched to cases by self-reported race and 5-year age categories.From 2009-2012, controls were recruited through community events, particularly through churches. 21

Black Women's Health Study (BWHS):
The BWHS is a prospective cohort study which recruited approximately 59,000 African American females, aged 21 -60 years, from all regions of the United States in 1995. 22Participants were enrolled by completing a postal health questionnaire and were followed by mail questionnaires every two years.DNA samples were obtained from BWHS participants (26,800 females) by the mouthwash-swish method with all samples stored in freezers at -80°C.

MD Anderson Breast Cancer Study (MDABCS):
All breast cancer cases in MDABCS are newly registered, histologically confirmed breast cancer patients at MD Anderson Cancer Center. 23Basic demographic and epidemiological information including smoking, alcohol, education, and family history data were collected as part of institutional patient history database.
Clinical data were abstracted from electronic medical records by clinical coding specialists.DNA were extracted from residual blood samples and banked in the institutional Blood Specimen Research Resource.

Wake Forest University Breast Cancer Study (WFBC):
The WFBC is a clinic-based casecontrol study at Wake Forest University Health Sciences from 1998 to 2008. 28,29Incident breast cancer cases were recruited at the Wake Forest University Breast Care Center.Controls were recruited from the patient population receiving routine mammography at the Outpatient Radiology-Breast Screening Center.Blood samples (20 ml) were collected from all study subjects.

New York University Women's Health Study (NYUWHS):
The NYUWHS is a cohort study which enrolled 14,274 females aged 34 to 65 years attending Guttman Breast Diagnostic Institute in New York City for yearly screening from 1985 to 1991. 30,31Self-administered questionnaires were used to collect demographic, medical, anthropometric, reproductive, and dietary.Non-fasting peripheral venous blood was drawn prior to breast examination and serum samples were stored at -80°C for subsequent biochemical analyses.Up until 1991, females who returned to the clinic for annual breast cancer screening were asked to donate blood at each of their visits.Cases were breast cancer patients arising from in the cohort, and controls were females selected from the same cohort who were not diagnosed with breast cancer and matched to cases on age and follow up time.

Barbados National Cancer Study (BNCS):
The BNCS is a population-based case-control study of incident breast and prostate cancer in the predominantly African population of Barbados, West Indies. 32Breast cancer cases were histologically confirmed incident cases identified through the only pathology department on the island, located at the Queen Elizabeth Hospital, between July 2002 and March 2006.Controls were selected from a national database provided by the Barbados Statistical Services Department, and were frequency matched to breast cancer cases at a 2:1 ratio and by 5-year age groups.Blood samples were collected from participants.

Racial Variability in Genotypic Determinants of Breast Cancer Risk Study (RVGBC):
RVGBC is a hospital-based case-control study conducted in Philadelphia and Detroit metropolitan areas from 1999 to 2003.Breast cancer cases were identified in the University of Pennsylvania Health System and Karmanos Cancer Institute.Local advertisement was also put to recruit breast cancer cases living in the Philadelphia and Detroit area.Controls were recruited in the same way as cases except that they did not have breast cancer.Patients with breast cancer had to be diagnosed within 18 months of recruitment and have invasive ductal cancer.The study over-sampled females diagnosed with breast cancer under age of 40 years.

Baltimore Breast Cancer Study (BBCS):
The BBCS is a case control study of breast cancer designed to identify and characterize markers of disease aggressiveness and poor outcome.From 1993 to 2003, incident breast cancer cases and controls were recruited from six hospitals in the greater Baltimore area, including the University of Maryland Medical Center, the Baltimore Veterans Affairs Medical Center, Union Memorial Hospital, Mercy Medical Center, and the Sinai Hospital.Controls were frequency matched to cases by race and age.
Family Registry (NC-BCFR): Incident breast cancer cases included females aged <65 years, identified through the SEER cancer registry of the Greater San Francisco Bay Area (diagnoses 1995-2009) and the Sacramento region (diagnoses 2005-2006). 24,25All cases with indicators of inherited breast cancer were included.Among cases aged 35 -64 years without such indicators, cases from racial and ethnic minority populations were oversampled.Controls were identified through random digit dialing and frequency matched to cases diagnosed from 1995 to 1998 on 5-year age group and race/ethnicity, at a ratio of one control per two cases.The Sister Study (SISTER): The Sister Study is a prospective cohort study designed to address genetic and environmental risk factors for breast cancer by the National Institute of Environmental Health Sciences.From 2003 through 2009, 50,884 U.S. females, including and Puerto Ricans, were recruited through a national multimedia campaign and network of recruitment volunteers, breast cancer professionals, and advocates.Participants were females aged 35 to 74 years and had a sister diagnosed with breast cancer. 26At enrollment, participants completed baseline questionnaires on medical and family history, lifestyle factors, and demographics.Blood samples were collected during a home visit by trained phlebotomists and shipped overnight to the Sister Study laboratory where they were processed to obtain serum and stored at -80°C.The Two Sister Study (2SISTER): The Two Sister Study is a family-based retrospective study developed from the Sister Study.The Two Sister Study recruited the case sisters in the Sister Study who were diagnosed within 4 years and had been younger than age 50 years at diagnosis.27

Figure S2 .
Figure S2.eQTpLot for gene EN1 with ER-negative breast cancer risk from Exp-TWAS.eQTpLot plots show the colocalization between eQTLs for gene EN1 and GWAS signals for ERnegative breast cancer risk.A: Locus with EN1 gene.Points show variant p-values for ERnegative breast cancer risk (vertical axis) and EN1 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of EN1 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for EN1 and ER-negative breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S3 .
Figure S3.eQTpLot for gene LINC01956 with ER-negative breast cancer risk from Exp-TWAS.eQTpLot plots show the colocalization between eQTLs for gene LINC01956 and GWAS signals for ER-negative breast cancer risk.A: Locus with LINC01956 gene.Points show variant pvalues for ER-negative breast cancer risk (vertical axis) and LINC01956 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of LINC01956 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for LINC01956 and ER-negative breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S4 .
Figure S4.eQTpLot for gene CTD-3080P12.3with ER-negative breast cancer risk from Exp-TWAS.eQTpLot plots show the colocalization between eQTLs for gene CTD-3080P12.3and GWAS signals for ER-negative breast cancer risk.A: Locus with CTD-3080P12.3gene.Points show variant p-values for ER-negative breast cancer risk (vertical axis) and CTD-3080P12.3expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of CTD-3080P12.3eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for CTD-3080P12.3and ER-negative breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S5 .
Figure S5.eQTpLot for gene EN1 with TNBC breast cancer risk from Exp-TWAS.eQTpLot plots show the colocalization between eQTLs for gene EN1 and GWAS signals for TNBC breast cancer risk.A: Locus with EN1 gene.Points show variant p-values for TNBC breast cancer risk (vertical axis) EN1 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of EN1 eQTLs among GWASsignificant variants.D: Correlation between PGWAS and PeQTL for EN1 and TNBC breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S6 .
Figure S6.eQTpLot for gene LINC01956 with TNBC breast cancer risk from Exp-TWAS.eQTpLot plots show the colocalization between eQTLs for gene LINC01956 and GWAS signals for TNBC breast cancer risk.A: Locus with LINC01956 gene.Points show variant p-values for TNBC breast cancer risk (vertical axis) and LINC01956 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of LINC01956 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for LINC01956 and TNBC breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S7 .
Figure S7.eQTpLot for gene MRPL34 with TNBC breast cancer risk from Exp-TWAS.eQTpLot plots show the colocalization between eQTLs for gene MRPL34 and GWAS signals for TNBC breast cancer risk.A: Locus with MRPL34 gene.Points show variant p-values for TNBC breast cancer risk (vertical axis) and MRPL34 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of MRPL34 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for MRPL34 and TNBC breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S8 .
Figure S8.eQTpLot for gene TET2 with overall breast cancer risk from APA-WAS.eQTpLot plots show the colocalization between eQTLs for gene TET2 and GWAS signals for overall breast cancer risk.A: Locus with TET2 gene.Points show variant p-values for overall breast cancer risk (vertical axis) and TET2 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.B: Genomic positions of all genes within the locus.C: Enrichment of TET2 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for TET2 and overall breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S9 .
Figure S9.eQTpLot for gene TET2 with ER-negative breast cancer risk from APA-WAS.eQTpLot plots show the colocalization between eQTLs for gene TET2 and GWAS signals for ER-negative breast cancer risk.A: Locus with TET2 gene.Points show variant p-values for ERnegative breast cancer risk (vertical axis) and TET2 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.B: Genomic positions of all genes within the locus.C: Enrichment of TET2 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for TET2 and ER-negative breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S10 .
Figure S10.eQTpLot for gene BRD9 with overall breast cancer risk from spTWAS.eQTpLot plots show the colocalization between eQTLs for gene BRD9 and GWAS signals for overall breast cancer risk.A: Locus with BRD9 gene.Points show variant p-values for overall breast cancer risk (vertical axis) and BRD9 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of BRD9 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for BRD9 and overall breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S11 .
Figure S11.eQTpLot for gene NUP210L with overall breast cancer risk from spTWAS.eQTpLot plots show the colocalization between eQTLs for gene NUP210L and GWAS signals for overall breast cancer risk.A: Locus with NUP210L gene.Points show variant for overall breast cancer risk (vertical axis) and NUP210L expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.B: Genomic positions of all genes within the locus.C: Enrichment of NUP210L eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for NUP210L and overall breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Figure S12 .
Figure S12.eQTpLot for gene BRD9 with ER-negative breast cancer risk from spTWAS.eQTpLot plots show the colocalization between eQTLs for gene BRD9 and GWAS signals for ER-negative breast cancer risk.A: Locus with BRD9 gene.Points show variant p-values for ERnegative breast cancer risk (vertical axis) and BRD9 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of BRD9 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for BRD9 and ER-negative breast cancer risk, with the computed Pearson correlation coefficient (r) and pvalue (p) displayed on the plot.

Figure S13 .
Figure S13.eQTpLot for gene BRD9 with TNBC breast cancer risk from spTWAS.eQTpLot plots show the colocalization between eQTLs for gene BRD9 and GWAS signals for TNBC breast cancer risk.A: Locus with BRD9 gene.Points show variant p-values for TNBC breast cancer risk (vertical axis) and BRD9 expression (color scale).Triangles indicate GWAS effect direction and eQTL effect size.The genome-wide significance threshold (510 -8 ) is shown as a red line.B: Genomic positions of all genes within the locus.C: Enrichment of BRD9 eQTLs among GWAS-significant variants.D: Correlation between PGWAS and PeQTL for BRD9 and TNBC breast cancer risk, with the computed Pearson correlation coefficient (r) and p-value (p) displayed on the plot.

Table S2 .
Sample sizes of studies contributing to the genome-wide association analysis.
a Studies with subtype cases less than 10 were not included in subtype analyses.b As a covariate in association analyses, study was adjusted as GBHS, MEC, and other studies in WGS dataset.c Matched controls from SCCS.d As a covariate in association analyses, study was adjusted as WCHS and other studies in AMBER dataset.

Table S3 .
Results from permutation tests and conditional analyses.

Table S4 .
Association results for genes identified in our recent TWAS conducted among females of Asian and European ancestry.

Table S5 .
Association results for the lead variants within ±500Kb regions of the 29 previously reported breast cancer-associated genes replicated in this study, results from the African-ancestry Breast Cancer Genetic Study.
b Effect allele frequency among controls.